This is my PM566 Final Project website.

Introduction

(provide background on your dataset and formulated question)

Data Background

This dataset was from National Vital Statistics System and focused on heart disease mortality data in US during 2014. The data was collected in county level. Here were the basic information of this dataset:

  • 2013 to 2015, 3-year average. Rates are age-standardized. County rates are spatially smoothed. The data can be viewed by gender and race/ethnicity. Data source: National Vital Statistics System. Additional data, maps, and methodology can be viewed on the Interactive Atlas of Heart Disease and Stroke http://www.cdc.gov/dhdsp/maps/atlas
Main question: How gender and races associate with heart disease death rate in CA during 2014?

Sub-question

  • what was the association between gender and heart disease death rate in California?
  • what was the association between race and heart disease death rate in California?
  • Which county had relatively higher heart disease death rate within gender stratification?
  • which county had relatively higher heart disease death rate within race stratification?

Methods

(include how and where the data were acquired, how you cleaned and wrangled the data, what tools you used for data exploration)

The data was obtained from CDC chronic disease and health promotion data & indicators: https://chronicdata.cdc.gov/Heart-Disease-Stroke-Prevention/Heart-Disease-Mortality-Data-Among-US-Adults-35-by/i2vk-mgdh

Data variables included:

  • Year: Center of 3-year average
  • LocationAbbr: State, Territory, or US postal abbreviation
  • LocationDesc: county name
  • GeographicLevel: county/state
  • DataSource
  • Class: Cardiovascular Diseases
  • Topic: Heart Disease Mortality
  • Data_Value: heart disease death rate
  • Data_Value_Unit: per 100,000 population
  • Data_Value_Type: Age-adjusted, Spatially Smoothed, 3-year Average Rate
  • Data_Value_Footnote_Symbol
  • Data_Value_Footnote
  • StratificationCategory1: gender
  • Stratification1: gender categories
  • StratificationCategory2: race
  • Stratification2: race categories (White, Black Hispanic, Asian and Pacific Islander, American Indian and Alaskan Native)
  • TopicID
  • LocationID
  • FIPS code
  • Location 1: lat&lon
#library R packages
library(gsubfn)
## Loading required package: proto
## Could not load tcltk.  Will use slower R code instead.
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dtplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(leaflet)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(raster)
## Loading required package: sp
## 
## Attaching package: 'raster'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:data.table':
## 
##     shift
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:raster':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(rjson)
# download and read in the data
if (!file.exists("Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County.csv")) {
download.file("https://chronicdata.cdc.gov/api/views/i2vk-mgdh/rows.csv?accessType=DOWNLOAD", 
              method="libcurl", 
              timeout = 60
              )
}
heartdisease <- data.table::fread("Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County.csv")
# check for head, tail and whether NAs exist
knitr::kable(dim(heartdisease))
x
59076
19
knitr::kable(summary(is.na(heartdisease)))
Year LocationAbbr LocationDesc GeographicLevel DataSource Class Topic Data_Value Data_Value_Unit Data_Value_Type Data_Value_Footnote_Symbol Data_Value_Footnote StratificationCategory1 Stratification1 StratificationCategory2 Stratification2 TopicID LocationID Location 1
Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:32149 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076 FALSE:59076
NA NA NA NA NA NA NA TRUE :26927 NA NA NA NA NA NA NA NA NA NA NA

Based on the summary table, only Data_Value contained NAs which referred to insufficient data. I decided to replace NAs by 0 for later convenience.

#remove NAs
heartdisease$Data_Value <- heartdisease$Data_Value %>% replace_na(0)

The summary table indicated that there were no NAs anymore.

Based on the main question, California data was selected

# selec data in California
heartdisease_CA <- heartdisease[LocationAbbr == 'CA' & GeographicLevel == 'County']

The Location 1 contained latitude and longitude information in one column, it would efficient to separate them into two columns.

# remove "()" in strings
heartdisease_CA$`Location 1` <- gsub("[()]", "", heartdisease_CA$`Location 1`)
# separate lat and lon variables
heartdisease_CA <- heartdisease_CA %>%
  separate(col = 'Location 1', into=c('lat', 'lon'), sep=',')

Convert Data_Value, lat, lon into num class

# convert chr to num
heartdisease_CA$Data_Value <- as.numeric(heartdisease_CA$Data_Value)
heartdisease_CA$lat <- as.numeric(heartdisease_CA$lat)
heartdisease_CA$lon <- as.numeric(heartdisease_CA$lon)

CA_gender contained the heart disease mortality data based on gender category. CA_race contained the heart disease mortality data based on race category. CA_overall contained the data without any stratification.

# select data under each stratification
CA_gender <- heartdisease_CA[Stratification1 != 'Overall' & Stratification2 == 'Overall']
CA_race <- heartdisease_CA[Stratification2 != 'Overall' & Stratification1 == 'Overall']

Since there were 58 counties in CA in total, the dataset seemed to be reasonable.

Preliminary Results

(provide summary statistics in tabular form and publication-quality figures, take a look at the kable function from knitr to write nice tables in Rmarkdown)

# create histogram to find association between gender and death rate
p1 <- ggplot(CA_gender, mapping = aes(x = Data_Value)) + 
    geom_histogram(mapping = aes (fill = Stratification1)) +
    scale_fill_brewer(palette = "BuPu") +
    labs(
      x = "death rate per 100,000 population",
      y = "Count",
      title = "Histogram of death rate by gender in CA")
ggplotly(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1 <- NULL

The histogram of death rate for both male and female was constructed, and insufficient data was represented by 0. From the graph, distributions of heart disease death rate for both male and female were slightly skewed to the left. However, the distribution of female located more left compared to the distribution of male, and only a small portion overlapped. It indicated that the female had a lower heart disease death rate compared to the male in general in CA during 2014.

# create histogram to find association between race and death rate
p2 <- ggplot(CA_race, mapping = aes(x = Data_Value)) + 
    geom_histogram(mapping = aes (fill = Stratification2)) +
    scale_fill_brewer(palette = "BuPu") +
    labs(
      x = "death rate per 100,000 population",
      y = "Count",
      title = "Histogram of death rate by race in CA")
ggplotly(p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p2 <- NULL 

The histogram of death rate for different race/ethnicity was constructed, and insufficient data was represented by 0. From the graph, both Hispanic and Asian and Pacific Islander seemed to be skewed to the left and had one modal. The White, Black and American Indian and Alaska Native seemed to have two modals. Among all races in CA during 2014, the Black had the highest heart disease death rate. The White and American Indian and Alaska Native were next and had similar distribution of death rate. The White seemed to have a slightly more right distribution. The Asian and Pacific Islander had a fourth highest heart disease death rate. The Hispanic had the lowest heart disease death rate.

CA_male <- CA_gender[Stratification1 == 'Male']%>% select(LocationDesc, Data_Value, Stratification1)
CA_female <- CA_gender[Stratification1 == 'Female'] %>% select(LocationDesc, Data_Value, Stratification1)
CA_joint <- merge(CA_male, CA_female, by.x = "LocationDesc", 
             by.y = "LocationDesc", all.x = TRUE, all.y = FALSE)
CA_joint <- rename(CA_joint, male_mortality = Data_Value.x, female_mortality = Data_Value.y)
CA_joint$Gap <- (CA_joint$male_mortality - CA_joint$female_mortality)
fig_gendergap <- plot_ly(CA_joint, x = ~male_mortality, y = ~female_mortality, text = ~LocationDesc, type = 'scatter', mode = 'markers',size = ~Gap, color = ~LocationDesc, colors = 'Paired',
        sizes = c(7, 30),
        marker = list(opacity = 0.5, sizemode = 'diameter'))
fig_gendergap <- fig_gendergap %>% layout(title = 'Gender Gap on heart disease death rate among CA county',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))

fig_gendergap
df <- read.csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv')
fips <- filter(df,state == "CA")

CA_gender <- merge(CA_joint, fips, by.x = "LocationDesc", 
             by.y = "name", all.x = TRUE, all.y = FALSE)
CA_gender <- CA_gender %>% 
  mutate(fips = ifelse(row_number()>= 1,paste0("0", fips)))
url <- 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties <- rjson::fromJSON(file=url)

Mortality rate chaning pattern in each county in CA

fig_male <- plot_ly(text = CA_gender$LocationDesc)
fig_male <- fig_male %>% add_trace(
    type="choroplethmapbox",
    geojson = counties,
    locations = CA_gender$fips,
    z = CA_gender$male_mortality,
    colorscale="Viridis",
    zmin = 150,
    zmax = 500,
    marker=list(line=list(
      width=0),
      opacity=0.5
    )
  )
fig_male <- fig_male %>% layout(
    mapbox=list(
      style="carto-positron",
      zoom =2,
      center=list(lon= -95.71, lat=37.09))
  )

fig_female <- plot_ly(text = CA_gender$LocationDesc)
fig_female <- fig_female %>% add_trace(
    type="choroplethmapbox",
    geojson = counties,
    locations = CA_gender$fips,
    z = CA_gender$female_mortality,
    colorscale="Viridis",
    zmin = 150,
    zmax = 500,
    marker=list(line=list(
      width=0),
      opacity=0.5
    )
  )
fig_female <- fig_female %>% layout(
    mapbox=list(
      style="carto-positron",
      zoom =2,
      center=list(lon= -95.71, lat=37.09))
  )

Male group mortality rate in CA during 2014

Female group mortality rate in CA during 2014

Within gender stratification and female category, Kern County, Tulare County and Glenn County had relatively higher heart disease death rate during 2014. For male category, Tulare County and Tuolumne County had relatvely higher heart disease death rate during 2014. Without any stratification, kern County and Tulare County had relatively higher heart disease death rate during 2014. The counties along the coast generally had a lower death rate than the counties not along the coast. The possible reason might be the different medical levels in each county. Compare the general trend, male had a higher death rate than female in general, which also indicated by histogram before. The possible reason might be the different lifestyle.

# create race subset 
# this part wiil be updated in the final project
CA_white <- heartdisease_CA[Stratification2 == 'White']
CA_hispanic <- heartdisease_CA[Stratification2 == 'Hispanic']
CA_black <- heartdisease_CA[Stratification2 == 'Black']
CA_asian_pacific <- heartdisease_CA[Stratification2 == 'Asian and Pacific Islander']
CA_indian_alaskan <- heartdisease_CA[Stratification2 == 'American Indian and Alaskan Native']

Within race stratification, for the white, Stanislaus County had relatively higher heart disease death rate during 2014. The death rate fell in the middle range. For the Hispanic, Kern county had relatively higher heart disease death rate during 2014. The death rate fell in a lower range. For the Black, Lassen County, Kings County and Tulare County had relatively higher heart disease death rate during 2014. The death rate fell in the higher range. For Asian and Pacific Islander, Mariposa County had relatively higher heart disease death rate during 2014. The death rate fell in the lower middle range. For American Indian and Alaska Native, Shasta County had relatively higher heart disease death rate during 2014. The death rate fall in the middle range. The possible reasons included uneven distribution of race/ethnicity, education level, income level and access to medical services.

Conclusion

There were association between both gender and race stratification and heart disease death rate in California during 2014. For gender stratification, female had a lower death rate than male in general. The female in Kern County, Tulare County and Glenn County had relatively higher heart disease death rate. The male in Tulare County and Tuolumne County had relatvely higher heart disease death rate. From a overall view, Kern County and Tulare County had relatively higher heart disease death rate. The possible reason might be that different county had different medical services level. The overall trend show that counties along the coast had lower death rate, which might due to higher developmental level. For race/ethnicity level, the Black had the highest death rate. especially in Lassen County, Kings County and Tulare County. The White and American Indian and Alaska Native had middle level of death rate, but slightly higher in Stanislaus County and Shasta County correspondingly. The Asian and Pacific Islander had a lower middle death range, and slghtly higher in Mariposa County. The Hispanic had the lowest death rate, and slightly higher in Kern county. The possible reasons may be different in distribution of races, education level, income level, healthcare availability and access to medical services.